I. Introduction

Alex on Kaggle has compiled datasets from sofifa.com relating to the FIFA22 soccer video game made by EA Sports. The dataset we chose to analyze is the players_fifa22.csv dataset. The overall dataset has 19,260 rows and 90 columns. It has a mix of different types of variables such as a player’s cateogirical variables like name, country, club team, position, etc. as well as over 70 quantative variables including their overall rating and their shooting rating, how much their contract is worth and more.

In FIFA, the most important thing for a player’s performance is their Overall. Because of that, we will make that the response variable and try to model is based on quantitative variables such as Age, Height, ShootingTotal, PassingTotal, DefendingTotal, PhysicalityTotal and ValueEUR and a categorical variable such as BestPosition. These variables provide a good base for creating a multiple linear regressions model to predict a player’s overall rating in the game.

library(tidyverse)
library(ggthemes)
library(car)
library(vip)
library(leaps)
library(HH)
library(qqplotr)
options(scipen = 9999)

theme_reach <- function() {
  theme_fivethirtyeight() +
    theme(
      legend.position = "none",
      plot.title = element_text(size = 22, hjust = 0.5, face = "bold"),
      plot.subtitle = element_text(size = 18, hjust = 0.5),
      axis.title.x = element_text(size=18),
      axis.title.y = element_text(size=18),
      axis.text = element_text(size = 14),
      strip.text = element_text(size = 16, face = "bold"),
      legend.text = element_text(size = 14),
    )
}

fifa_22 <- read_csv(url("https://raw.githubusercontent.com/tejseth/stats-401-project/master/players_fifa22.csv"))

II. Early Data Analysis (EDA)

With a dataset of this magnitude, we felt like it was important to select just the variables we needed. On top of that, since 19,260 rows is a lot of data to work with, we randomly selected a quarter of it (4,815 rows) so that it’s easier to see the correlations and use.

fifa_small <- fifa_22 %>%
  dplyr::select(FullName, Age, Height, Weight, PhotoUrl, Overall, BestPosition, ValueEUR,
         ShootingTotal, PassingTotal, DefendingTotal, PhysicalityTotal)

fifa_smallest <- sample_n(fifa_small, 4815)

fifa_smallest <- fifa_smallest %>% filter(ValueEUR > 0)

Now that we had the parameters of the dataset we wanted, we can do some early data analysis on it by looking at distributions by using histograms to see if the data is normally distributed or needs to be log transformed in the model.

fifa_smallest %>%
  ggplot(aes(x = Overall)) +
  geom_histogram(fill = "#012e67", alpha = 0.9, bins = 40) +
  scale_color_identity() +
  theme_reach() +
  labs(x = "Overall Rating",
       y = "Count",
       title = "Histogram of Overall Rating in FIFA 22") +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  scale_y_continuous(breaks = scales::pretty_breaks())

fifa_smallest %>%
  ggplot(aes(x = ValueEUR)) +
  geom_histogram(fill = "#012e67", alpha = 0.9, bins = 40) +
  scale_color_identity() +
  theme_reach() +
  labs(x = "Contract Value (Euro's)",
       y = "Count",
       title = "Histogram of Contract Value in FIFA 22") +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  scale_y_continuous(breaks = scales::pretty_breaks())

fifa_smallest %>%
  ggplot(aes(x = ShootingTotal)) +
  geom_histogram(fill = "#012e67", alpha = 0.9, bins = 40) +
  scale_color_identity() +
  theme_reach() +
  labs(x = "Shooting Rating",
       y = "Count",
       title = "Histogram of Shooting Rating in FIFA 22") +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  scale_y_continuous(breaks = scales::pretty_breaks())

fifa_smallest %>%
  ggplot(aes(x = PassingTotal)) +
  geom_histogram(fill = "#012e67", alpha = 0.9, bins = 40) +
  scale_color_identity() +
  theme_reach() +
  labs(x = "Passing Rating",
       y = "Count",
       title = "Histogram of Passing Rating in FIFA 22") +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  scale_y_continuous(breaks = scales::pretty_breaks())

fifa_smallest %>%
  ggplot(aes(x = DefendingTotal)) +
  geom_histogram(fill = "#012e67", alpha = 0.9, bins = 40) +
  scale_color_identity() +
  theme_reach() +
  labs(x = "Defending Rating",
       y = "Count",
       title = "Histogram of Defending Rating in FIFA 22") +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  scale_y_continuous(breaks = scales::pretty_breaks())

fifa_smallest %>%
  ggplot(aes(x = PhysicalityTotal)) +
  geom_histogram(fill = "#012e67", alpha = 0.9, bins = 60) +
  scale_color_identity() +
  theme_reach() +
  labs(x = "Physicality Rating",
       y = "Count",
       title = "Histogram of Physicality Rating in FIFA 22") +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  scale_y_continuous(breaks = scales::pretty_breaks())

With the histograms done, we can also make scatterplots with each explanatory variable on the x-axis and the response variable of Overall on the y-axis

fifa_smallest %>%
  ggplot(aes(x = Age, y = Overall)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 1.2) +
  geom_point(fill = "darkorange", shape = 21, color = "black", alpha = 0.8, size = 3.5) +
  theme_reach() +
  labs(x = "Age (Years)",
       y = "Overall (0-100)",
       title = "Scatterplot of Age and Overall Rating")

fifa_smallest %>%
  ggplot(aes(x = Height, y = Overall)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 1.2) +
  geom_point(fill = "darkorange", shape = 21, color = "black", alpha = 0.8, size = 3.5) +
  theme_reach() +
  labs(x = "Height (Centimeters)",
       y = "Overall (0-100)",
       title = "Scatterplot of Height and Overall Rating")

fifa_smallest %>%
  ggplot(aes(x = ShootingTotal, y = Overall)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 1.2) +
  geom_point(fill = "darkorange", shape = 21, color = "black", alpha = 0.8, size = 3.5) +
  theme_reach() +
  labs(x = "Shooting Rating (0-100)",
       y = "Overall (0-100)",
       title = "Scatterplot of Shooting Rating and Overall Rating")

fifa_smallest %>%
  ggplot(aes(x = DefendingTotal, y = Overall)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 1.2) +
  geom_point(fill = "darkorange", shape = 21, color = "black", alpha = 0.8, size = 3.5) +
  theme_reach() +
  labs(x = "Defending Rating (0-100)",
       y = "Overall (0-100)",
       title = "Scatterplot of Defending Rating and Overall Rating")

fifa_smallest %>%
  ggplot(aes(x = PassingTotal, y = Overall)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 1.2) +
  geom_point(fill = "darkorange", shape = 21, color = "black", alpha = 0.8, size = 3.5) +
  theme_reach() +
  labs(x = "Passing Rating (0-100)",
       y = "Overall (0-100)",
       title = "Scatterplot of Dribbling Rating and Overall Rating")

fifa_smallest %>%
  ggplot(aes(x = PhysicalityTotal, y = Overall)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 1.2) +
  geom_point(fill = "darkorange", shape = 21, color = "black", alpha = 0.8, size = 3.5) +
  theme_reach() +
  labs(x = "Physicality Rating (0-100)",
       y = "Overall (0-100)",
       title = "Scatterplot of Physicality Rating and Overall Rating")

fifa_smallest %>%
  ggplot(aes(x = log(ValueEUR), y = Overall)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 1.2) +
  geom_point(fill = "darkorange", shape = 21, color = "black", alpha = 0.8, size = 3.5) +
  theme_reach() +
  labs(x = "Log-Transformed Contract Value (Euros)",
       y = "Overall (0-100)",
       title = "Scatterplot of Logged Contract Value and Overall Rating")

We also can make a boxplot to analyze the differences in Overall rating based on position.

fifa_smallest %>%
  group_by(BestPosition) %>%
  mutate(med = median(Overall)) %>%
  ungroup() %>% 
  ggplot(aes(x = fct_reorder(BestPosition, med), y = Overall)) +
  geom_boxplot(aes(fill = fct_reorder(BestPosition, med))) +
  scale_fill_viridis_d() +
  theme_reach() +
  labs(x = "Position",
       y = "Overall Rating",
       title = "Overall Rating by Position in Fifa 22",
       fill = "Position") +
  theme(axis.text.x = element_text(angle = -30))

An interaction term can be explored between ValueEUR and BestPosition as positional importance might play an impact in soccer and thus clubs pay for players based on their skill level and the position they play.

fifa_smallest %>%
  ggplot(aes(x = log(ValueEUR), y = Overall, color = BestPosition)) +
  geom_point(aes(shape = BestPosition, color = BestPosition), size = 4) +
  geom_smooth(se = FALSE) +
  theme_reach() +
  theme(legend.position = "bottom") +
  labs(x = "Log-Transformed Contract Value (Euros)",
       y = "Overall (0-100)",
       title = "Scatterplot of Logged Contract Value and Overall Rating by Position")
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 15. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 2104 rows containing missing values (geom_point).

In context of the data, contract values vary significant in soccer depending on the position. As seen in the graph, defensive players and goalies are typically valued much less than positions such as center attacking mids and strikers. The interaction term helps account for the difference in slopes and intercepts.

III. Making the Model

The first thing we can do to make the model is check the adjusted R^2 for each model subset to make sure we’re not using too few or too many variables. As shown in the plot above, an interaction term should be added between ValueEUR and BestPosition as there can be different slopes.

fifa_full <- regsubsets(Overall ~ Age + Height + ShootingTotal + I(ShootingTotal^2) + 
                 DefendingTotal + I(DefendingTotal^2)  + PassingTotal + 
                PhysicalityTotal + log(ValueEUR) * BestPosition, data = fifa_smallest)

summaryHH(fifa_full)
##                                                  model p   rsq   rss adjr2
## 1                                             lg(VEUR) 2 0.788 47502 0.787
## 2                                           A-lg(VEUR) 3 0.967  7372 0.967
## 3                                       A-lg(VEUR)-BPG 4 0.969  6873 0.969
## 4                                     A-D-lg(VEUR)-BPG 5 0.970  6635 0.970
## 5                               A-PsT-PhT-lg(VEUR)-BPG 6 0.971  6470 0.971
## 6                  A-PsT-PhT-lg(VEUR)-BPG-l(VEUR):BPCB 7 0.971  6379 0.971
## 7              A-I(D-PsT-PhT-lg(VEUR)-BPG-l(VEUR):BPCM 8 0.972  6297 0.972
## 8 A-I(D-PsT-PhT-lg(VEUR)-BPG-l(VEUR):BPCD-l(VEUR):BPCM 9 0.972  6249 0.972
##      cp    bic stderr
## 1 32606  -7404   3.15
## 2  1018 -16321   1.24
## 3   627 -16648   1.20
## 4   442 -16809   1.18
## 5   314 -16921   1.16
## 6   245 -16980   1.15
## 7   182 -17034   1.15
## 8   146 -17062   1.14
## 
## Model variables with abbreviations
##                                                                                                                                                                                              model
## lg(VEUR)                                                                                                                                                                             log(ValueEUR)
## A-lg(VEUR)                                                                                                                                                                       Age-log(ValueEUR)
## A-lg(VEUR)-BPG                                                                                                                                                    Age-log(ValueEUR)-BestPositionGK
## A-D-lg(VEUR)-BPG                                                                                                                                   Age-DefendingTotal-log(ValueEUR)-BestPositionGK
## A-PsT-PhT-lg(VEUR)-BPG                                                                                                              Age-PassingTotal-PhysicalityTotal-log(ValueEUR)-BestPositionGK
## A-PsT-PhT-lg(VEUR)-BPG-l(VEUR):BPCB                                                                    Age-PassingTotal-PhysicalityTotal-log(ValueEUR)-BestPositionGK-log(ValueEUR):BestPositionCB
## A-I(D-PsT-PhT-lg(VEUR)-BPG-l(VEUR):BPCM                                            Age-I(DefendingTotal^2)-PassingTotal-PhysicalityTotal-log(ValueEUR)-BestPositionGK-log(ValueEUR):BestPositionCM
## A-I(D-PsT-PhT-lg(VEUR)-BPG-l(VEUR):BPCD-l(VEUR):BPCM Age-I(DefendingTotal^2)-PassingTotal-PhysicalityTotal-log(ValueEUR)-BestPositionGK-log(ValueEUR):BestPositionCDM-log(ValueEUR):BestPositionCM
## 
## model with largest adjr2
## 8 
## 
## Number of observations
## 4791

We can see that the best adjusted R^2 is the 8th model with all the varibales included. Next, we can build the model and check for covariance. Since some variables have a quadratic term, we will remove those when calculating covariance.

lm_fifa <- lm(Overall ~ Age + Height + ShootingTotal + I(ShootingTotal^2) + 
                 DefendingTotal + I(DefendingTotal^2)  + PassingTotal + 
                PhysicalityTotal + log(ValueEUR) * BestPosition, data = fifa_smallest)
                 
summary(lm_fifa)
## 
## Call:
## lm(formula = Overall ~ Age + Height + ShootingTotal + I(ShootingTotal^2) + 
##     DefendingTotal + I(DefendingTotal^2) + PassingTotal + PhysicalityTotal + 
##     log(ValueEUR) * BestPosition, data = fifa_smallest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4128 -0.6646 -0.0027  0.6408  5.8193 
## 
## Coefficients:
##                                  Estimate  Std. Error t value
## (Intercept)                   -15.9839432   0.9830998 -16.259
## Age                             0.5308427   0.0055882  94.994
## Height                         -0.0028991   0.0033432  -0.867
## ShootingTotal                   0.0258604   0.0113007   2.288
## I(ShootingTotal^2)             -0.0001084   0.0001141  -0.950
## DefendingTotal                 -0.0131268   0.0095004  -1.382
## I(DefendingTotal^2)             0.0003349   0.0001082   3.094
## PassingTotal                    0.0426151   0.0042042  10.136
## PhysicalityTotal                0.0254218   0.0030276   8.397
## log(ValueEUR)                   4.5498323   0.0465035  97.838
## BestPositionCB                  2.1264355   0.7867742   2.703
## BestPositionCDM                 2.3363049   0.8829289   2.646
## BestPositionCF                 -5.8541138   5.9269086  -0.988
## BestPositionCM                  2.6067260   0.9202890   2.833
## BestPositionGK                  4.0468687   0.7445756   5.435
## BestPositionLB                  1.2261623   1.0538376   1.164
## BestPositionLM                 -1.0576878   1.1688897  -0.905
## BestPositionLW                  3.4082295   2.5844925   1.319
## BestPositionLWB                 1.7839193   1.6988739   1.050
## BestPositionRB                  3.3441120   0.9919090   3.371
## BestPositionRM                 -3.7393362   0.8921629  -4.191
## BestPositionRW                 -1.2277112   1.7384205  -0.706
## BestPositionRWB                -1.3717434   1.6583135  -0.827
## BestPositionST                 -1.3355784   0.7498780  -1.781
## log(ValueEUR):BestPositionCB   -0.1162556   0.0583676  -1.992
## log(ValueEUR):BestPositionCDM  -0.1780910   0.0640337  -2.781
## log(ValueEUR):BestPositionCF    0.4795322   0.4240500   1.131
## log(ValueEUR):BestPositionCM   -0.2180198   0.0647434  -3.367
## log(ValueEUR):BestPositionGK   -0.2193152   0.0541526  -4.050
## log(ValueEUR):BestPositionLB   -0.0642993   0.0764602  -0.841
## log(ValueEUR):BestPositionLM    0.0885914   0.0838773   1.056
## log(ValueEUR):BestPositionLW   -0.1975784   0.1819723  -1.086
## log(ValueEUR):BestPositionLWB  -0.1095306   0.1214700  -0.902
## log(ValueEUR):BestPositionRB   -0.2139471   0.0714688  -2.994
## log(ValueEUR):BestPositionRM    0.2887704   0.0640841   4.506
## log(ValueEUR):BestPositionRW    0.1057307   0.1242694   0.851
## log(ValueEUR):BestPositionRWB   0.1190034   0.1186133   1.003
## log(ValueEUR):BestPositionST    0.1131241   0.0529455   2.137
##                                           Pr(>|t|)    
## (Intercept)                   < 0.0000000000000002 ***
## Age                           < 0.0000000000000002 ***
## Height                                    0.385904    
## ShootingTotal                             0.022159 *  
## I(ShootingTotal^2)                        0.342252    
## DefendingTotal                            0.167125    
## I(DefendingTotal^2)                       0.001988 ** 
## PassingTotal                  < 0.0000000000000002 ***
## PhysicalityTotal              < 0.0000000000000002 ***
## log(ValueEUR)                 < 0.0000000000000002 ***
## BestPositionCB                            0.006902 ** 
## BestPositionCDM                           0.008170 ** 
## BestPositionCF                            0.323341    
## BestPositionCM                            0.004638 ** 
## BestPositionGK                        0.0000000575 ***
## BestPositionLB                            0.244677    
## BestPositionLM                            0.365583    
## BestPositionLW                            0.187325    
## BestPositionLWB                           0.293744    
## BestPositionRB                            0.000754 ***
## BestPositionRM                        0.0000282379 ***
## BestPositionRW                            0.480085    
## BestPositionRWB                           0.408170    
## BestPositionST                            0.074966 .  
## log(ValueEUR):BestPositionCB              0.046452 *  
## log(ValueEUR):BestPositionCDM             0.005437 ** 
## log(ValueEUR):BestPositionCF              0.258180    
## log(ValueEUR):BestPositionCM              0.000765 ***
## log(ValueEUR):BestPositionGK          0.0000520447 ***
## log(ValueEUR):BestPositionLB              0.400417    
## log(ValueEUR):BestPositionLM              0.290929    
## log(ValueEUR):BestPositionLW              0.277640    
## log(ValueEUR):BestPositionLWB             0.367257    
## log(ValueEUR):BestPositionRB              0.002771 ** 
## log(ValueEUR):BestPositionRM          0.0000067609 ***
## log(ValueEUR):BestPositionRW              0.394913    
## log(ValueEUR):BestPositionRWB             0.315773    
## log(ValueEUR):BestPositionST              0.032680 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.127 on 4753 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.9728 
## F-statistic:  4628 on 37 and 4753 DF,  p-value: < 0.00000000000000022
lm_fifa_vif <- lm(Overall ~ Age + Height +
                 DefendingTotal  + PassingTotal + PhysicalityTotal + 
                   log(ValueEUR) + BestPosition, data = fifa_smallest)

vif(lm_fifa_vif)
##              Age           Height   DefendingTotal     PassingTotal 
##         2.075440         1.957593         5.537207         4.806880 
## PhysicalityTotal    log(ValueEUR)   BestPositionCB  BestPositionCDM 
##         3.140977         3.216646         5.682797         2.208462 
##   BestPositionCF   BestPositionCM   BestPositionGK   BestPositionLB 
##         1.020249         1.663277         2.358784         1.777876 
##   BestPositionLM   BestPositionLW  BestPositionLWB   BestPositionRB 
##         1.343971         1.093095         1.221632         1.918832 
##   BestPositionRM   BestPositionRW  BestPositionRWB   BestPositionST 
##         1.604307         1.200794         1.283476         3.373610

Since ShootingTotal, DefendingTotal, and PassingTotal have vif scores that are over 5, we can assume that there is some multicollinearity happening between those variables. Since ShootingTotal and I(ShootingTotal)^2 didn’t have much of a significance in the summary of the model, we can remove those so that no more covariance exists within the model.

fifa_full <- lm(Overall ~ Age + Height + DefendingTotal + 
                        I(DefendingTotal^2)  + PassingTotal + PhysicalityTotal + 
                          log(ValueEUR) * BestPosition, data = fifa_smallest)

summary(fifa_full)
## 
## Call:
## lm(formula = Overall ~ Age + Height + DefendingTotal + I(DefendingTotal^2) + 
##     PassingTotal + PhysicalityTotal + log(ValueEUR) * BestPosition, 
##     data = fifa_smallest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3187 -0.6601 -0.0104  0.6504  5.8192 
## 
## Coefficients:
##                                  Estimate  Std. Error t value
## (Intercept)                   -15.7451448   0.8109329 -19.416
## Age                             0.5348201   0.0051695 103.456
## Height                         -0.0026671   0.0033490  -0.796
## DefendingTotal                 -0.0204428   0.0093656  -2.183
## I(DefendingTotal^2)             0.0003938   0.0001070   3.679
## PassingTotal                    0.0527364   0.0036976  14.262
## PhysicalityTotal                0.0274004   0.0030098   9.104
## log(ValueEUR)                   4.5656033   0.0410505 111.219
## BestPositionCB                  2.1795143   0.7479181   2.914
## BestPositionCDM                 2.3534484   0.8769485   2.684
## BestPositionCF                 -5.4345224   5.9404805  -0.915
## BestPositionCM                  2.6463328   0.9225174   2.869
## BestPositionGK                  4.1489235   0.7450797   5.568
## BestPositionLB                  1.0839934   1.0366219   1.046
## BestPositionLM                 -0.9962207   1.1713227  -0.851
## BestPositionLW                  3.7167589   2.5900000   1.435
## BestPositionLWB                 1.5286992   1.6955128   0.902
## BestPositionRB                  3.1127323   0.9766934   3.187
## BestPositionRM                 -3.8254368   0.8928187  -4.285
## BestPositionRW                 -1.0677535   1.7425470  -0.613
## BestPositionRWB                -1.3970591   1.6523982  -0.845
## BestPositionST                 -1.1231816   0.7485461  -1.500
## log(ValueEUR):BestPositionCB   -0.1366353   0.0552776  -2.472
## log(ValueEUR):BestPositionCDM  -0.1865873   0.0634614  -2.940
## log(ValueEUR):BestPositionCF    0.4548015   0.4250167   1.070
## log(ValueEUR):BestPositionCM   -0.2242138   0.0648886  -3.455
## log(ValueEUR):BestPositionGK   -0.2249119   0.0542464  -4.146
## log(ValueEUR):BestPositionLB   -0.0658661   0.0750616  -0.877
## log(ValueEUR):BestPositionLM    0.0842171   0.0840398   1.002
## log(ValueEUR):BestPositionLW   -0.2157294   0.1823712  -1.183
## log(ValueEUR):BestPositionLWB  -0.1005754   0.1211645  -0.830
## log(ValueEUR):BestPositionRB   -0.2094248   0.0702438  -2.981
## log(ValueEUR):BestPositionRM    0.2950331   0.0640971   4.603
## log(ValueEUR):BestPositionRW    0.0968973   0.1245726   0.778
## log(ValueEUR):BestPositionRWB   0.1099945   0.1180144   0.932
## log(ValueEUR):BestPositionST    0.1061929   0.0528429   2.010
##                                           Pr(>|t|)    
## (Intercept)                   < 0.0000000000000002 ***
## Age                           < 0.0000000000000002 ***
## Height                                    0.425845    
## DefendingTotal                            0.029102 *  
## I(DefendingTotal^2)                       0.000236 ***
## PassingTotal                  < 0.0000000000000002 ***
## PhysicalityTotal              < 0.0000000000000002 ***
## log(ValueEUR)                 < 0.0000000000000002 ***
## BestPositionCB                            0.003584 ** 
## BestPositionCDM                           0.007307 ** 
## BestPositionCF                            0.360328    
## BestPositionCM                            0.004141 ** 
## BestPositionGK                        0.0000000271 ***
## BestPositionLB                            0.295754    
## BestPositionLM                            0.395085    
## BestPositionLW                            0.151341    
## BestPositionLWB                           0.367307    
## BestPositionRB                            0.001447 ** 
## BestPositionRM                        0.0000186623 ***
## BestPositionRW                            0.540068    
## BestPositionRWB                           0.397889    
## BestPositionST                            0.133555    
## log(ValueEUR):BestPositionCB              0.013478 *  
## log(ValueEUR):BestPositionCDM             0.003296 ** 
## log(ValueEUR):BestPositionCF              0.284638    
## log(ValueEUR):BestPositionCM              0.000554 ***
## log(ValueEUR):BestPositionGK          0.0000344048 ***
## log(ValueEUR):BestPositionLB              0.380263    
## log(ValueEUR):BestPositionLM              0.316342    
## log(ValueEUR):BestPositionLW              0.236902    
## log(ValueEUR):BestPositionLWB             0.406539    
## log(ValueEUR):BestPositionRB              0.002884 ** 
## log(ValueEUR):BestPositionRM          0.0000042748 ***
## log(ValueEUR):BestPositionRW              0.436703    
## log(ValueEUR):BestPositionRWB             0.351362    
## log(ValueEUR):BestPositionST              0.044531 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.13 on 4755 degrees of freedom
## Multiple R-squared:  0.9728, Adjusted R-squared:  0.9726 
## F-statistic:  4867 on 35 and 4755 DF,  p-value: < 0.00000000000000022
plot(fifa_full$fitted.values, fifa_full$residuals, 
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Fitted Value and Residuals Plot")

qqnorm(fifa_full$residuals, main = "QQ Plot of the Residuals")
qqline(fifa_full$residuals)

We use the above residuals vs fitted values plot to test for our assumption of linearity and constant variance. For Linearity, we can say that our data reasonably meets the requirements since there seems to be no clear pattern to the data other than a slight downwards tail at the end. For constant variance however, our data does not meet the requirement since there appears to be differences and inconsistencies in vertical fluctuation of the curve when moving from left to right.

We use the QQ plot above to test for normality within our data. Overall, we can say that our data reasonably meets this requirement because a huge majority of the data falls on the linear line plotted on the graph. Near the beginning and the end of the graph we see some deviation, but this is expected because it can be expected that there are some players that could initially be valued low that end up having surprising breakout seasons in addition to players that could be valued very highly but then have below average seasons. As a result, we can say that the assumption of normality is met and it logically makes sense that there is some deviation for very high and low ratings.

IV. Conclusion

Our overall goal with this model was to create a regression model to indicate overall player rating in the Fifa 22 video game. Our early data for several variables ended up being fairly consistent other than a logarithmic function needed for ValueEUR since the data was highly right skewed, and a quadratic function for the shooting and defending totals since the relationship was not linear. In addition, we included an interaction term for ValueEUR and BestPosition based on the fact the slopes vary, so each of these variables differ in the way they affect our overall response variable of player rating.

Next, we confirmed our final model with all the transformations included above and checked for multicollinearity. We did this by calculating the variance inflation factors to see if any of these values were above 10. None of them ended up being that high but it is important to note that our DefendingTotal variable and ShootingTotal variable had values slightly above 5 which can reflect a minimal amount of multic to some extent and since taking out ShootingTotal doesn’t change the adjusted R^2 it was reasonable to do that for the final model.

We were pleased with the results as the adjusted R^2 of 0.9722 is very high for a linear regression model, especially one that doesn’t show signs of overfitting, multicollinearity and overparameterization. In terms of our diagnostic plots, for our assumptions of linearity and normality we were satisfied but not with our assumption of constant variance. For linearity, our residual vs fitted values plot reasonably met as there wasn’t a clear and consistent pattern present, and for normality our QQ plot didn’t have major deviations other than a few points at the beginning and end of the plot. For the assumption of constant variance, the width of the band of residuals from left to right are definitely inconsistent so we were not comfortable saying that this assumption was reasonably met.